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£S) i Abstract 

Many data analysis applications deal with large matrices and involve approximating the 
matrix using a small number of "components." Typically, these components are linear combi- 
nations of the rows and columns of the matrix, and are thus difficult to interpret in terms of 
' the original features of the input data. In this paper, we propose and study matrix approxi- 

£N| , mations that are explicitly expressed in terms of a small number of columns and/or rows of 

the data matrix, and thereby more amenable to interpretation in terms of the original data. 
, Our main algorithmic results are two randomized algorithms which take as input an m x n 

■ matrix A and a rank parameter k. In our first algorithm, C is chosen, and we let A' = CC + A, 
where C + is the Moore-Penrose generalized inverse of C. In our second algorithm C, U, R 
are chosen, and we let A' = CUR. (C and R are matrices that consist of actual columns 

i— — i, and rows, respectively, of A, and U is a generalized inverse of their intersection.) For each 

algorithm, we show that with probability at least 1 — 8 

\\A-A'\\ F <(l + e)\\A-A k \\ F , 

, where Ak is the "best" rank-fc approximation provided by truncating the singular value decom- 

£T) • position (SVD) of A, and where ||^|| F is the Frobenius norm of the matrix X. The number 

\ of columns of C and rows of R is a low-degree polynomial in k, 1/e, and log(l/<5). Both 

■ the Numerical Linear Algebra community and the Theoretical Computer Science community 
f**» , have studied variants of these matrix decompositions over the last ten years. However, our two 

algorithms are the first polynomial time algorithms for such low-rank matrix approximations 
, that come with relative-error guarantees; previously, in some cases, it was not even known 

whether such matrix decompositions exist. Both of our algorithms are simple and they take 
r% | time of the order needed to approximately compute the top k singular vectors of A. 

The technical crux of our analysis is a novel, intuitive sampling method we introduce 
in this paper called "subspace sampling." In subspace sampling, the sampling probabilities 
depend on the Euclidean norms of the rows of the top singular vectors. This allows us to 
obtain provable relative-error guarantees by deconvoluting "subspace" information and "size- 
of-A" information in the input matrix. This technique is likely to be useful for other matrix 
approximation and data analysis problems. 
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1 Introduction 



Large m x n matrices are common in applications since the data often consist of m objects, each 
of which is described by n features. Examples of object-feature pairs include: documents and 
words contained in those documents; genomes and environmental conditions under which gene 
responses are measured; stocks and their associated temporal resolution; hyperspectral images 
and frequency resolution; and web groups and individual users. In each of these application areas, 
practitioners spend vast amounts of time analyzing the data in order to understand, interpret, 
and ultimately use this data for some application-specific task. 

Say that A is the m x n data matrix. In many cases, an important step in data analysis is 
to construct a compressed representation of A that may be easier to analyze and interpret. The 
most common such representation is obtained by truncating the Singular Value Decomposition 
(SVD) at some number k <C min{m,n} terms, in large part because this provides the "best" 
rank-fc approximation to A when measured with respect to any unitarily invariant matrix norm. 
Unfortunately, the basis vectors (the so-called eigencolumns and eigenrows) provided by this 
approximation (and with respect to which every column and row of the original data matrix is 
expressed) are notoriously difficult to interpret in terms of the underlying data and processes 
generating that data. For example, the vector [(1/2) age - (l/s/2) height + (1/2) income], being 
one of the significant uncorrelated "factors" from a dataset of people's features is not particularly 
informative. It would be highly preferable to have a low-rank approximation that is nearly as 
good as that provided by the SVD but that is expressed in terms of a small number of actual 
columns and/or actual rows of a matrix, rather than linear combinations of those columns and 
rows. 

The main contribution of this paper is to provide such decompositions. In particular, we 
provide what we call a relative-error CUR matrix decomposition: given an m x n matrix A, we 
decompose it as a product of three matrices, C, U, and R, where C consists of a small number of 
actual columns of A, R consists of a small number of actual rows of A, and U is a small carefully 
constructed matrix that guarantees that the product CUR is "close" to A. In fact, CUR will 
be nearly as good as the best low-rank approximation to A that is traditionally used and that is 
obtained by truncating the SVD. Hence, the columns of A that are included in C, as well as the 
rows of A that are included in R, can be used in place of the eigencolumns and eigenrows, with 
the added benefit of improved interpretability in terms of the original data. 

Before describing applications of our main results in the next subsection, we would like to 
emphasize that two research communities, the Numerical Linear Algebra (NLA) community and 
the Theoretical Computer Science (TCS) community, have provided significant practical and 
theoretical motivation for studying variants of these matrix decompositions over the last ten 
years. In Section 3, we provide a detailed treatment of relevant prior work in both the NLA and 
the TCS literature. The two algorithms presented in this paper are the first polynomial time 
algorithms for such low-rank matrix approximations that come with relative-error guarantees; 
previously, in some cases, it was not even known whether such matrix decompositions exist. 

1.1 Applications 

As an example of this preference for having the data matrix expressed in terms of a small number 
of actual columns and rows of the original matrix, as opposed to a small number of eigencolumns 
and eigenrows, consider recent data analysis work in DNA microarray and DNA Single Nucleotide 
Polymorphism (SNP) analysis [44, 47, 52]. DNA SNP data are often modeled as an m x n matrix 
A, where m is the number of individuals in the study, n is the number of SNPs being analyzed, and 
Aij is an encoding of the j-th SNP value for the i-th individual. Similarly, for DNA microarray 
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data, m is the number of genes under consideration, n is the number of arrays or environmental 
conditions, and Aij is the absolute or relative expression level of the i-th gene in the j-th environ- 
mental condition. Biologists typically have an understanding of a single gene that they fail to have 
about a linear combination of 6000 genes (and also similarly for SNPs, individuals, and arrays); 
thus, recent work in genetics on DNA microarray and DNA SNP data has focused on heuristics 
to extract actual genes, environmental conditions, individuals, and SNPs from the eigengenes, 
eigenconditions, eigenpeople, and eigenSNPs computed from the original data matrices [44, 47]. 1 
Our CUR matrix decomposition is a direct formulation of this problem: determine a small num- 
ber of actual SNPs that serve as a basis with which to express the remaining SNPs, and a small 
number of individuals to serve as a basis with which to express the remaining individuals. In fact, 
motivated in part by this, we have successfully applied a variant of the CUR matrix decomposition 
presented in this paper to intra- and inter-population genotype reconstruction from tagging SNPs 
in DNA SNP data from a geographically-diverse set of populations [52]. In addition, we have 
applied a different variant of our CUR matrix decomposition to hyperspectrally-resolved medical 
imaging data [48]. In this application, a column corresponds to an image at a single physical 
frequency and a row corresponds to a single spectrally-resolved pixel, and we have shown that 
data reconstruction and classification tasks can be performed with little loss in quality even after 
substantial data compression [48]. 

A quite different motivation for low-rank matrix approximations expressed in terms of a small 
number of columns and/or rows of the original matrix is to decompose efficiently large low- 
rank matrices that possess additional structure such as sparsity or non-negativity. This often 
arises in the analysis of, e.g., large term-document matrices [58, 59, 8]. Another motivation 
comes from statistical learning theory, where the data need not even be elements in a vector 
space, and thus expressing the Gram matrix in terms of a small number of actual data points 
is of interest [64, 63, 24, 25]. This procedure has been shown empirically to perform well for 
approximate Gaussian process classification and regression [64], to approximate the solution of 
spectral partitioning for image and video segmentation [32], and to extend the eigenfunctions 
of a data-dependent kernel to new data points [7, 45]. Yet another motivation is provided by 
integral equation applications [40, 39, 38], where large coefficient matrices arise that have blocks 
corresponding to regions where the kernel is smooth and that are thus well-approximated by low- 
rank matrices. In these applications, partial SVD algorithms can be expensive, and a description 
in terms of actual columns and/or rows is of interest [39, 38]. A final motivation for studying 
matrix decompositions of this form is to obtain low-rank matrix approximations to extremely 
large matrices where a computation of the SVD is too expensive [33, 34, 21, 22, 23]. 

1.2 Our Main Results 

Our main algorithmic results have to do with efficiently computing low-rank matrix approxima- 
tions that are explicitly expressed in terms of a small number of columns and/or rows of the input 
matrix. We start with the following definition. 

Definition 1 Let A be an m x n matrix. For any given C, an m x c matrix whose columns 
consist of c columns of the matrix A, the m x n matrix A' = CX is a column-based matrix 

For example, in their review article "Vector algebra in the analysis of genome- wide expression data" [44], 
which appeared in Genome Biology, Kuruvilla, Park, and Schreiber describe many uses of the vectors provided 
by the SVD and PCA in DNA microarray analysis. The three biologists then conclude by stating that: "While 
very efficient basis vectors, the vectors themselves are completely artificial and do not correspond to actual (DNA 
expression) profiles. ... Thus, it would be interesting to try to find basis vectors for all experiment vectors, using 
actual experiment vectors and not artificial bases that offer little insight." That is, they explicitly state that they 
would like decompositions of the form we provide in this paper! 
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approximation to A, or CX matrix decomposition, for any c x n matrix X. 

Several things should be noted about this definition. First, we will be interested in c <C n in 
our applications. For example, depending on the application, c could be constant, independent 
of n, logarithmic in the size of n, or simply a large constant factor less than n. Second, a CX 
matrix decomposition expresses each of the columns of A in terms of a linear combination of 
"dictionary elements" or "basis columns," each of which is an actual column of A. Thus, a CX 
matrix decomposition provides a low-rank approximation to the original matrix, although one 
with structural properties that are quite different than those provided by the SVD. Third, given 
a set of columns C, the approximation A' = PqA = CC + A (where PqA is the projection of A 
onto the subspace spanned by the columns of C and C + is the Moore-Penrose generalized inverse 
of C, as defined in Section 2) clearly satisfies the requirements of Definition 1. Indeed, this is the 
"best" such approximation to A, in the sense that \\A — C {C + A) \\ F = minxgRcxn \\A — CX\\ F . 
Our first main result is the following. 

Theorem 1 Given a matrix A £ flj mxn an d an integer k <C min{m,n}, there exist randomized 
algorithms such that either exactly c = 0(k 2 log(l/5)/e 2 ) columns of A are chosen to construct 
C, or c = 0(/elog A;log(l/5)/e 2 ) columns are chosen in expectation to construct C , such that with 
probability at least 1 — 6, 

min \\A-CX\\ F = \\A-CC + A\\„< (1 + e) \\A - A k \\ F . (1) 

Here, C is a matrix consisting of the chosen columns of A, CC + A is the projection of A on the 
subspace spanned by the chosen columns, and Ak is the best rank-k approximation to A. Both 
algorithms run in time 0(SVD(A,k)), which is the time required to compute the best rank-k 
approximation to the matrix A [37]. 

Note that we use c > k and have an e error, which allows us to take advantage of linear algebraic 
structure in order to obtain an efficient algorithm. In general, this would not be the case if, given 
an m x n matrix A, we had specified a parameter k and asked for the "best" subset of k columns, 
where "best" is measured, e.g., by maximizing the Frobenius norm captured by projecting onto 
those columns or by maximizing the volume of the parallelepiped defined by those columns. Also, 
it is not clear a priori that C with properties above even exists; see the discussion in Sections 3.2 
and 3.3. Finally, our result does not include any reference to regularization or conditioning, as 
is common in certain application domains; a discussion of similar work on related problems in 
numerical linear algebra may be found in Section 3.1. 

Our second main result extends the previous result to CUR matrix decompositions. 

Definition 2 Let A be an m x n matrix. For any given C , an m x c matrix whose columns 
consist of c columns of the matrix A, and R, an r x n matrix whose rows consist of r rows of the 
matrix A, the m x n matrix A' = CUR is a column-row-based matrix approximation to A, or 
CUR matrix decomposition, for any c x r matrix U. 

Several things should be noted about this definition. First, a CUR matrix decomposition is a 
CX matrix decomposition, but one with a very special structure, i.e., every column of A can be 
expressed in terms of the basis provided by C using only the information contained in a small 
number of rows of A and a low-dimensional encoding matrix. Second, in terms of its singular 
value structure, U must clearly contain "inverse-of-^4" information. For the CUR decomposition 
described in this paper, U will be a generalized inverse of the intersection between C and R. 
More precisely, if C = AS C D C and R = DrS^A then U = {D R S^AS C D C ) + ■ (See Section 2 
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for a review of linear algebra and notation, such as that for Sc, Dc, Sr, and Dr.) Third, the 
combined size of C, U and R is 0(mc + rn + cr), which is an improvement over A's size of 0{mn) 
when c, r <C n, m. Finally, note the structural simplicity of a CUR matrix decomposition: 




raxn mxc 



Our main result for CUR matrix decomposition is the following. 

Theorem 2 Given a matrix A £ w nxn an d an integer k <C min{m, n}, there exist randomized al- 
gorithms such that exactly c = 0(k 2 log(l/<5)/e 2 ) columns of A are chosen to construct C , and then 
exactly r = 0(c 2 log(l/(5)/e 2 ) rows of A are chosen to construct R, or c = 0(k log k log(l/5)/e 2 ) 
columns of A in expectation are chosen to construct C, and then r = 0(clogclog(l/5)/e 2 ) rows 
of A in expectation are chosen to construct R, such that such that with probability at least 1 — 5, 

\\A-CUR\\ F <{l + e)\\A- A k \\ F . (3) 

Here, the matrix U is a weighted Moore-Penrose inverse of the intersection between C and R, and 
Ak is the best rank-k approximation to A. Both algorithms run in time 0(SVD(A,k)), which is 
the time required to compute the best rank-k approximation to the matrix A [37]. 



1.3 Summary of Main Technical Result 

The key technical insight that leads to the relative-error guarantees is that the columns are 
selected by a novel sampling procedure that we call "subspace sampling." Rather than sample 
columns from A with a probability distribution that depends on the Euclidean norms of the 
columns of A (which gives provable additive-error bounds [21, 22, 23]), in "subspace sampling" 
we randomly sample columns of A with a probability distribution that depends on the Euclidean 
norms of the rows of the top k right singular vectors of A. This allows us to capture entirely a 
certain subspace of interest. Let Va k be the n x k matrix whose columns consist of the top k 
right singular vectors of A. The "subspace sampling" probabilities pi,i € [n] will satisfy 



Pi> - 



(vk, fc ) w 



A: 



Vi€[n], (4) 



for some (3 € (0,1], where (V^fc)^) is the i-th row of Va^- That is, we will sample based on 
the norms of the rows (not the columns) of the truncated matrix of singular vectors. Note that 

2 

YJj=i (YA,k)tj\ = k and that Yli^[ n }Pi = 1- To construct sampling probabilities satisfying 
Condition (4), it is sufficient to spend 0(SVD(A, k)) time to compute (exactly or approximately, 
in which case (3 = 1 or f3 < 1, respectively) the top k right singular vectors of A. Sampling 
probabilities of this form will allow us to deconvolute subspace information and "size-of-^4" infor- 
mation in the input matrix A, which in turn will allow us to obtain the relative-error guarantees 
we desire. Note that we have used this method previously [29], but in that case the sampling 
probabilities contained other terms that complicated their interpretation. 

We will use these "subspace sampling" probabilities in our main technical result, which is a 
random sampling algorithm for approximating the following generalized version of the standard 
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£2 regression problem. Our main column/row-based approximation algorithmic results will follow 
from this result. Given as input a matrix A £ M. mxn that has rank no more than k and a matrix 
of target vectors B € M mxp , compute 

Z = min \\B-AX\\ F . (5) 

That is, fit every column of the matrix B to the basis provided by the columns of the rank-A; 
matrix A. Also of interest is the computation of 

X opt = A+B. (6) 

The main technical result of this paper is a simple sampling algorithm that represents the matrices 
A and B by a small number of rows so that this generalized £2 regression problem can be solved 
to accuracy 1 ± e for any e > 0. 

More precisely, we present and analyze an algorithm (Algorithm 3 of Section 6) that constructs 
and solves an induced subproblem of the generalized £2 regression problem of Equations (5) and 
(6). Let DS T A be the r x n matrix consisting of the sampled and appropriately rescaled rows 
of the original matrix A, and let DS B be the r x p matrix consisting of the sampled and 
appropriately rescaled rows of B. Then consider the problem 

Z = min \\DS T B -DS T AX\\ T? . (7) 

xeR nx p 

The "smallest" matrix X op t £ W ixp among those that achieve the minimum value Z in this 
sampled £2 regression problem is 

X opt = (DS T A) + DS T B. (8) 

Since we will sample a number of rows rCmof the original problem, we will compute (8), and 
thus (7), exactly. Our main theorem, Theorem 5, states that under appropriate assumptions on 
the original problem and on the sampling probabilities, the computed quantities Z and X opt will 
provide very accurate relative-error approximations to the exact solution Z and the optimal vector 
X opt . Rows will be sampled with one of two random sampling procedures. In one case, exactly 
r = 0(k 2 /e 2 ) rows are chosen, and in the other case, r = 0(k\ogk/e 2 ) rows in expectation are 
chosen. In either case, the most expensive part of the computation involves the computation of 
the Euclidean norms of the rows of the right singular vectors of A which are used in the sampling 
probabilities. 

1.4 Outline of the Remainder of the Paper 

In the next two sections, we provide a review of relevant linear algebra, and we discuss related 
work. Then, in Sections 4 and 5, we present in detail our main algorithmic results. In Section 4, we 
describe our main column-based matrix approximation algorithm, and in Section 5, we describe 
our main column-row-based matrix approximation algorithm. Then, in Section 6, we present 
an approximation algorithm for generalized £2 regression. This is our main technical result, 
and from it our two main algorithmic results will follow. Finally, in Section 7 we present an 
empirical evaluation of our algorithms, and in Section 8 we present a brief conclusion. We devote 
Appendix A to two prior algorithms for approximate matrix multiplication. These two algorithms 
select columns and rows in a complementary manner, and they are essential in the proof of our 
main results. 
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2 Review of Linear Algebra 



In this section, we provide a review of linear algebra that will be useful throughout the paper; 
for more details, see [50, 43, 60, 37, 9, 6]. We also review a sampling matrix formalism that will 
be convenient in our discussion [21]. 

Let [n] denote the set {1,2, . . . , n}. For any matrix A G R mxn , let A^,i G [m] denote the 
i-th row of A as a row vector, and let A^\j G [n] denote the j-ih column of A as a column 
vector. In addition, let H^H^ = Y^Li Sj=i ^Hj denote the square of its Frobenius norm, and let 
1 1 ^4 1 1 2 = sup^jjn^ x -£ \Ax\ 2 I \x\ 2 denote its spectral norm. These norms satisfy: ||vl|| 2 < H^Hf — 
\J min{m, n} ||-A|| 2 for any matrix A; and also ||AB||^ < ||j4||^ ||-B|| 2 for any matrices A and B. 

If A G R mxn , then there exist orthogonal matrices U = [u l v? . . .u m ] G n mxm and V = 
[v V . . . v n ] G R nxn such that U T AV = £ = diag^, . . . , cr f ), where S G R mxn , £ = mm{m, n} 
and o\ > cr 2 > . . . > > 0. Equivalently, A = UY,V T . The three matrices U, V, and S 
constitute the Singular Value Decomposition (SVD) of A. The crj are the singular values of A, 
the vectors u 1 , v l are the i-th left and the i-th right singular vectors of A, respectively, and the 
condition number of A is k{A) = a max (A) / cr m i n (A) . If k < r = rank(A), then the SVD of A may 
be written as 

A = Ua^aVI = [ U k Ujt ] 

Here, is the k x k diagonal matrix containing the top k singular values of A, and is 
the (r — k) x (r — k) diagonal matrix containing the bottom r — k nonzero singular values of A. 
Also, V fc T is the k x n matrix whose rows are the top k right singular vectors of A, is the 

(r — k) x n matrix whose rows are the bottom r — k right singular vectors of A, and Ut and 
are defined similarly. If we define A^ = Uk^kViF, then the distance (as measured by both ||-|| 2 
and ||-||^) between A and any rank k approximation to A is minimized by A].. We will denote by 
0(SVD(A, k)) the time required to compute the best rank-fc approximation to the matrix A [37]. 
Finally, for any orthogonal matrix U G IR mxc , let U 1 - G J£ mx ( m - C ) denote an orthogonal matrix 
whose columns are an orthonormal basis spanning the subspace of W 71 that is orthogonal to the 
column space of U. 

Given a matrix A G R mxn , the unweighted Moore-Penrose generalized inverse of A, denoted 
by A + , is the unique n x m matrix that satisfies the four Moore-Penrose conditions [50, 6]. In 
terms of the SVD this generalized inverse may be written as A + = V/iE^C/J (where the square 
diagonal rank(A) x rank(A) matrix S^4, as in (9), is invertible by construction). If, in addition, 
D\ G W nxm and D 2 G R nxn are diagonal matrices with positive entries along the diagonal, then 
the {D\, Z) 2 }-Moore-Penrose generalized inverse of A, denoted by Af Di D ^y is a generalization of 
the Moore-Penrose inverse that can be expressed in terms of the unweighted generalized inverse 
of A as At Di D ^ = D 2 (DlJ 2 AD 2 1 ^ 2 )+ d\^ 2 . Also, in terms of the generalized inverse, the 

projection onto the column space of any matrix A may be written as Pa = AA + . 

Since our main algorithms will involve sampling columns and/or rows from input matrices 
(using one of two related random sampling procedures described in Appendix A), we conclude 
this subsection with a brief review of a sampling matrix formalism that was introduced in [21] 
and with respect to which our sampling matrix operations may be conveniently expressed. First, 
assume that d (= c exactly) columns of A are chosen in c i.i.d. trials by randomly sampling 
according to a probability distribution {pj}™ =1 with the Exactly(c) algorithm (described in 
detail in Appendix A), and assume that the ij-th column of A is chosen in the i-th (for t = 1, . . . , c) 
independent random trial. Then, define the sampling matrix S G IR nxc to be the zero-one matrix 
where S^t = 1 and Sij = otherwise, and define the diagonal rescaling matrix D G M. cxc to 



" S fc 






o s fei± _ 




v k ±T 



(9) 
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be the diagonal matrix with D tt = l/y/cpi t , where pi t is the probability of choosing the it-th 
column. Alternatively, assume that d (< c in expectation) columns of A are chosen with the 
Expected (c) algorithm (also described in detail in Appendix A) by including the i-th column of 
A in C with probability pi = min{l, cpi}. Then, define the sampling matrix S 6 M nxn to be the 
zero-one matrix where Sa = 1 if the i-th column is chosen and Sy = otherwise, and define the 
rescaling matrix D £ W IXC to be the matrix with Dij = 1/ \/cpj if i — 1 of the previous columns 
have been chosen and = otherwise. Clearly, in both of these cases, C = ASD is an m x c' 
matrix consisting of sampled and rescaled copies of the columns of A, and R = (SD) T A = DS T A 
is a c' x n matrix consisting of sampled and rescaled copies of the rows of A. In certain cases, 
we will subscript S and D with C or R (e.g., C = AScDc and R = DrS^A) to make explicit 
that the corresponding sampling and rescaling matrices are operating on the columns or rows, 
respectively, of A. 

3 Relationship with Previous Related Work 

In this section, we discuss the relationship between our results and related work in numerical 
linear algebra and theoretical computer science. 

3.1 Related Work in Numerical Linear Algebra 

Within the numerical linear algebra community, several groups have studied matrix decomposi- 
tions with similar structural, if not algorithmic, properties to the CX and CUR matrix decompo- 
sitions we have defined. Much of this work is related to the QR decomposition, originally used 
extensively in pivoted form by Golub [36, 11]. 

Stewart and collaborators were interested in computing sparse low-rank approximations to 
large sparse term-document matrices [58, 59, 8]. He developed the quasi-Gram-Schmidt method. 
This method is a variant of the QR decomposition which, when given as input an m x n matrix 
A and a rank parameter k, returns an m x k matrix C consisting of k columns of A whose span 
approximates the column space of A and also a nonsingular upper-triangular k x k matrix Tc 
that orthogonalizes these columns (but it does not explicitly compute the nonsparse orthogonal 
matrix Qc = CT^ 1 ). This provides a matrix decomposition of the form A ~ CX. By applying 
this method to A to obtain C and to A T to obtain an k x n matrix R consisting of k rows of 
A, one can show that A « CUR, where the matrix U is computed to minimize \\A — CUR\\ 2 F . 
Although provable approximation guarantees of the form we present were not provided, backward 
error analysis was performed and the method was shown to perform well empirically [58, 59, 8]. 

Goreinov, Tyrtyshnikov, and Zamarashkin [39, 38, 61] were interested in applications such 
as scattering, in which large coefficient matrices have blocks that can be easily approximated 
by low-rank matrices. They show that if the matrix A is approximated by a rank-fc matrix to 
within an accuracy e then there exists a choice of k columns and k rows, i.e., C and R, and a 
low-dimensional k x k matrix U constructed from the elements of C and R, such that A ~ CUR 
in the sense that \\A — CUR\\ 2 < ef(m,n,k), where f(m,n,k) = 1 + 2\Jkm + 2\fhn. In [39], 
the choice for these matrices is related to the problem of determining the minimum singular 
value dfc of k x k submatrices of n x k orthogonal matrices. In addition: in [38] the choice 
for C and R is interpreted in terms of the maximum volume concept from interpolation theory, 
in the sense that columns and rows should be chosen such that their intersection W defines a 
parallelepiped of maximum volume among all k x k submatrices of A; and in [61] an empirically 
effective deterministic algorithm is presented which ensures that U is well-conditioned. 

Gu and Eisenstat, in their seminal paper [40], describe a strong rank-revealing QR factoriza- 
tion that deterministically selects exactly k columns from an m x n matrix A. The algorithms of 
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[40] are efficient, in that their running time is 0(mn 2 ) (assuming that m> n), which is essentially 
the time required to compute the SVD of A. In addition, Gu and Eisenstat prove that if the mxk 
matrix C contains the k selected columns (without any rescaling), then a m i n (C) > ak(A)/ f(k,n), 
where f(k,n) = 0(y / k(n — k)). Thus, the columns of C span a parallelepiped whose volume 
(equivalently, the product of the singular values of C) is "large." Currently, we do not know 
how to convert this property into a statement similar to that of Theorem 1, although perhaps 
this can be accomplished by relaxing the number of columns selected by the algorithms of [40] to 
0{poly(k, 1/e)). For related work prior to Gu and Eisenstat, see Chan and Hansen [12, 13]. 

Finally, very recently, Martinsson, Rokhlin, and Tygert [49] proposed another related method 
to efficiently compute an approximation to the best rank-A; approximation of an m x n matrix A. 
The heart of their algorithm is a random projection method, which projects A to a small number 
- say £ - of random vectors; the entries of these random vectors are i.i.d. Gaussians of zero mean 
and unit variance. The general form of their bounds is quite complicated, but by setting, e.g., 
£ = k + 20, they construct a rank-/c approximation A' to A such that 

\\A - A'\\ 2 < 10y/(k + 20) m \\A - A k \\ 2 (10) 

holds with probability at least 1 — 10~ 17 . In addition, the authors extend their algorithm to 
compute the so-called interpolative decomposition of a matrix A. This decomposition is explicitly 
expressed in terms of a small number of columns of A, and is a more restrictive version of our 
CX matrix decomposition. More specifically, it additionally requires that every entry of X is 
bounded in absolute value by a small constant (e.g., two). Thus, their algorithm computes an 
interpolative approximation A' = CX to A, where C has only £ = k + 20 columns - as opposed 
to the 0(k log k) columns that are necessary in our work - and satisfies the bound of (10). Notice 
that their work provides bounds for the spectral norm, whereas our work focuses only on the 
Frobenius norm. However, their bounds are much weaker than our relative error bounds, since 
^Jm{k + 20) \A — Afc|| 2 might in general be larger even than ||^4|| F . 

3.2 Related Work in Theoretical Computer Science 

Within the theory of algorithms community, much research has followed the seminal work of 
Frieze, Kannan, and Vempala [33, 34]. Their work may be viewed, in our parlance, as sampling 
columns from a matrix A to form a matrix C such that \\A — CX\\ F < \\A — Ak\\ F + e ||^4||^- 
The matrix C has poly(k,l/e,l/5) columns and is constructed after making only two passes 
over A using 0(m + n) work space. Under similar resource constraints, a series of papers have 
followed [33, 34] in the past seven years [19, 22, 55], improving the dependency of c on k, 1/e, and 
1/8, and analyzing the spectral as well as the Frobenius norm, yielding bounds of the form 

\\A-CX\\^< \\A- A k \\^ + e\\A\\ F (11) 

for £ = 2, F, and thus providing additive-error guarantees for column-based low-rank matrix 
approximations . 

Additive-error approximation algorithms for CUR matrix decompositions have also been an- 
alyzed by Drineas, Kannan, and Mahoney [20, 21, 22, 23, 24, 25]. In particular, in [23], they 
compute an approximation to an m x n matrix A by sampling c columns and r rows from A 
to form m x c and r x n matrices C and R, respectively. From C and R, a c x r matrix U is 
constructed such that under appropriate assumptions 

\\A-CUR\\£< \\A-A k \\ ( + e\\A\\ F , (12) 

with high probability, for both the spectral and Frobenius norms, £ = 2, F. In [24, 25], it is 
further shown that if A is a symmetric positive semidefinite (SPSD) matrix, then one can choose 
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R = C T and U = W + , where W is the c x c intersection between C and R = C T , thus obtaining 
an approximation A ~ A' = CW + C T . This approximation is SPSD and has provable bounds of 
the form (12), except that the scale of the additional additive error is somewhat larger [24, 25]. 

Most relevant for our relative-error CX and CUR matrix decomposition algorithms is the 
recent work of Rademacher, Vempala and Wang [53] and Deshpande, Rademacher, Vempala 
and Wang [17]. Using two different methods (in one case iterative sampling in a backwards 
manner and an induction on k argument [53] and in the other case an argument which relies on 
estimating the volume of the simplex formed by each of the &:-sized subsets of the columns [17]), 
they reported the existence of a set of Oik? je 2 ) columns that provide relative-error CX matrix 
decomposition. No algorithmic result was presented, except for an exhaustive algorithm that ran 
in Q(n k ) time. Note that their results did not apply to columns and rows simultaneously. Thus, 
ours is the first CUR matrix decomposition algorithm with relative error, and it was previously not 
even known whether such a relative-error CUR representation existed, i.e., it was not previously 
known whether columns and rows satisfying the conditions of Theorem 2 existed. 

Other related work includes that of Rudelson and Vershynin [54, 62, 56], who provide an 
algorithm for CX matrix decomposition which has an improved additive error spectral norm 
bound of the form 



Their proof uses an elegant result on random vectors in the isotropic position [54], and since 
we use a variant of their result, it is described in more detail in Appendix A. Achlioptas and 
McSherry have computed low-rank matrix approximations using sampling techniques that involve 
zeroing-out and/or quantizing individual elements [2, 1]. The primary focus of their work was 
in introducing methods to accelerate orthogonal iteration and Lanczos iteration methods, and 
their analysis relied heavily on ideas from random matrix theory [2, 1]. Agarwal, Har-Peled, and 
Varadarajan have analyzed so-called "core sets" as a tool for efficiently approximating various 
extent measures of a point set [3, 4]. The choice of columns and/or rows we present are a "core 
set" for approximate matrix computations; in fact, our algorithmic solution to Theorem 1 solves 
an open question in their survey [4]. The choice of columns and rows we present may also 
be viewed as a set of variables and features chosen from a data matrix [10, 14, 41]. "Feature 
selection" is a broad area that addresses the choice of columns explicitly for dimension reduction, 
but the metrics there are typically optimization-based [14] or machine-learning based [10]. These 
formulations tend to have set-cover like solutions and are incomparable with the linear-algebraic 
structure such as the low-rank criteria we consider here that is common among data analysts. 

3.3 Very Recent Work on Relative-Error Approximation Algorithms 

To the best of our knowledge, the first nontrivial algorithmic result for relative-error low-rank 
matrix approximation was provided by a preliminary version of this paper [27, 28]. In particular, 
an earlier version of Theorem 1 provided the first known relative-error column-based low-rank 
approximation in polynomial time [27, 28]. The major difference between our Theorem 1 and 
our result in [27, 28] is that the sampling probabilities in [27, 28] are more complicated. (See 
Section 6.2 for details on this.) The algorithm from [27, 28] runs in 0(SVD(A, k)) time (although 
it was originally reported to run in only 0(SVD(A)) time), and it has a sampling complexity of 
Oik 2 log(l/(5)/e 2 ) columns. 

Subsequent to the completion of the preliminary version of this paper [27, 28], several devel- 
opments have been made on relative-error low-rank matrix approximation algorithms. First, Har- 
Peled reported an algorithm that takes as input an m x n matrix A, and in roughly 0(mnk 2 log k) 
time returns as output a rank-fc matrix A' with a relative-error approximation guarantee [42]. His 



A - CX\\ 2 < \\A - A k \\ 2 + eJ \\A\\ 2 \\A 
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algorithm uses geometric ideas and involves sampling and merging approximately optimal fc-flats; 
it is not clear if this approximation can be expressed in terms of a small number of columns 
of A. Then, Deshpande and Vempala [18] reported an algorithm that takes as input an m x n 
matrix A that also returns a relative-error approximation guarantee. Their algorithm extends 
ideas from [53, 17], and it leads to a CX matrix decomposition consisting of O(klogk) columns 
of A. The complexity of their algorithm is 0(Mk 2 log k), where M is the number of nonzero 
elements of A, and their algorithm can be implemented in a data streaming framework with 
O(klogk) passes over the data. In light of these developments, we simplified and generalized our 
preliminary results [27, 28], and we performed a more refined analysis to improve our sampling 
complexity to 0(k log k). Most recently, we learned of work by Sarlos [57], who used ideas from 
the recently developed fast Johnson-Lindenstrauss transform of Ailon and Chazelle [5] to yield 
further improvements to a CX matrix decomposition. 



4 Our Main Column-Based Matrix Approximation Algorithm 

In this section, we describe an algorithm and a theorem, from which our first main result, Theo- 
rem 1, will follow. 



4.1 Description of the Algorithm 

Algorithm 1 takes as input an m x n matrix A, a rank parameter k, and an error parameter 
e. It returns as output an m x c matrix C consisting of a small number of columns of A. The 
algorithm is very simple: sample a small number of columns according to a carefully-constructed 
nonuniform probability distribution. Algorithm 1 uses the sampling probabilities 



m= k 



1 2 



but it will be clear from the analysis of Section 6 that any sampling probabilities such that 

W 2 



K) 



/k, for some (3 G (0,1], will also work with a small /3-dependent loss in 

2 

accuracy. Note that Algorithm 1 actually consists of two related algorithms, depending on how 
exactly the columns are chosen. The Exactly (c) algorithm picks exactly c columns of A to be 
included in C in c i.i.d. trials, where in each trial the i-th column of A is picked with probability 
Pi. The Expected (c) algorithm picks in expectation at most c columns of A to create C, by 
including the i-th column of A in C with probability min{l,q?j}. See Algorithms 4 and 5 in 
Appendix A for more details about these two column-sampling procedures. 

The running time of Algorithm 1 is dominated by the computation of the sampling proba- 
bilities (13), for which 0(SVD(A,k)) time suffices. The top k right singular vectors of A can 
be efficiently (approximately) computed using standard algorithms [37, 51]. The building block 
of these algorithms is a series of matrix-vector multiplications, where the input matrix A is it- 
er atively multiplied with a changing set of k orthogonal vectors. In each iteration (which can 
be implemented by making passes over the input matrix A) , the accuracy of the approximation 
improves. Even though the number of iterations required to bound the error depends on quan- 
tities such as the gap between the singular values of A, these algorithms work extremely well in 
practice. As such, they are often treated as "black boxes" for SVD computation in the theoretical 
computer science literature; see, e.g., [2, 1]. 



4.2 Statement of the Theorem 

Theorem 3 is our main quality-of- approximation result for Algorithm 1. 



10 



Data : A G M mxn , a rank parameter k, and an error parameter e. 
Result : C £R mxc 

• Compute sampling probabilities pi for all z G [n] given by (13); 

• (Implicitly) construct a sampling matrix Sc and a diagonal rescaling matrix Dq 
with the Exactly(c) algorithm or with the Expected(c) algorithm; 

• Construct and return the matrix C = AScDc consisting of a small number of 
rescaled columns of A; 



Algorithm 1: A randomized algorithm for CX matrix decomposition. 

Theorem 3 Let A £ M. mxn , let k be a rank parameter, and let e G (0, 1]. If we set c = 3200fc 2 /e 2 
and run Algorithm 1 by choosing exactly c columns from A with the Exactly(c) algorithm, then 
with probability at least 0.7 

\\A-CC + A\\ F <{l + e)\\A-A k \\ F (14) 

Similarly, if we set c = 0(klogk/e 2 ) and run Algorithm 1 by choosing no more than c columns 
in expectation from A with the Expected(c) algorithm, then (14) holds with probability at least 
0.7. 

Proof: Since for every set of columns C = ASqDc, X op t = C + A is the matrix that minimizes 
\\A — CX\\ F , it follows that 

\\A-CC+A\\ F = \\A-(AScDc)(AS c Dc) + A\\ F 

< \\A-{AScDc)(P A ,kAScDc) + PA,kA\\ F , (15) 

where Pa,u = U A,kfJ\ k is a projection onto the top k left singular vectors of A. To bound (15), 
consider the problem of approximating the solution to minj eRm xm || XA k — A\\ F by randomly 
sampling columns of A k and of A. It follows as a corollary of (21) of Theorem 5 of Section 6 that 

\\A - {AS c Dc){A k S c Dc) + A k \\ F < (1 + e) \\A - AA+A k \\ p = (1 + e) \\A - A k \\ F , (16) 

which, when combined with (15), establishes the theorem. 

o 

Remark: For simplicity of presentation, we have presented Algorithm 1 and Theorem 3 such 
that (14) holds with only constant probability, but this can be boosted to hold with probability at 
least 1 — 5 using standard methods. In particular, consider the following: run Algorithm 1 (using 
either the Exactly(c) algorithm or the Expected(c) algorithm, but with the appropriate value 
of c) independently ln(l/5) times; and return the C such that ||^4 — CC + A|| F is smallest. Then, 
since in each trial the claim of Theorem 3 fails with probability less than 0.3 < 1/e, the claim 
of Theorem 3 will fail for every trial with probability less than (l/e) ln ( 1 / <5 ^ = 5. This establishes 
Theorem 1. 

Remark: For simplicity of presentation, we have also stated Theorem 3 in such a way that 
the rank of the approximating matrix A' = CC + A may be greater than k. This possibility may 
be undesirable in certain applications, and it can be easily removed. Let A = C (PA, k C) + PA tk A. 
Then, it follows from (16) that A is a CX matrix approximation that is within relative error e 
of the best rank-fc approximation to A and that has rank no more than k. 
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4.3 Discussion of the Analysis 



Given a matrix A, Theorem 1 asks us to find a set of columns C = AScP>c such that CC + A 
"captures" almost as much of A as does Ak = UA,kU\ k A. Given that set (or any other set) of 
columns C, it is well-known that the matrix X op t = C + A is the "smallest" matrix among those 
that solve the optimization problem (19). For a given A and C, let us approximate X opt as 

X opt = C + A « (P A , k C) + P Atk A. 

This approximation is suboptimal with respect to solving the optimization problem (19), i.e., 

\\A - CC + A\\ p <\\A-C {P A , k C) + P A , k A\\ F , 

but it can be shown that by choosing C properly, i.e., by choosing Sc and Dc (the column 
sampling and rescaling matrices) properly, we have that 

\\A - C (P Ak C) + P A , k A\\ p < (1 + e) \\A - A k \\ F . 

The main technical challenge is to sample in a manner such that the column-sampled version of 
the matrix consisting of the top k right singular vectors of A is full rank, i.e., rank(Vjf k ScDc) = 
rank(Vj fe ) = k. To accomplish this, we sample with respect to probabilities of the form (13). To 
understand these sampling probabilities, recall that we seek to pick columns that span almost the 
same subspace as the top k left singular vectors of A (i.e., U k ), and recall that the i-th column 
of A is equal to 

AW = t4£ fc (V k T f + U p _ k Z p _ k {Vj_ k f . 

n 2 

Since post-multiplying U k by T, k does not change the span of the columns of U k , (V k T ) 
measures "how much" of the i-th column of A lies in the span of U Ajk , independent of the 
magnitude of the singular values associated with those directions. 



5 Our Main Column-Row-Based Matrix Approximation Algo- 
rithm 

In this section, we describe an algorithm and a theorem that, when combined with the results of 
Section 4, will establish our second main result, Theorem 2. 



5.1 Description of the Algorithm 

Algorithm 2 takes as input an m x n matrix A, an m x c matrix C consisting of a small number 
of columns of A, and an error parameter e. It returns as output anrxn matrix R consisting of 
a small number of rows of A and an r x c matrix W consisting of the corresponding rows of C. 
The algorithm is very simple: sample a small number of rows according to a carefully-constructed 
nonuniform probability distribution. Algorithm 2 uses the sampling probabilities 



Pi 



Vi G [m], 



(17) 



but it will be clear from the analysis of Section 6 that any sampling probabilities Pi,i £ [m], such 
that pi > (5 {U£) w /c, for some P £ (0, 1], will also work with a small /3-dependent loss in 
accuracy. Note that Algorithm 2 actually consists of two related algorithms, depending on how 
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exactly the rows are chosen. The Exactly (c) algorithm picks exactly r rows of A to be included 
in R in r i.i.d. trials, where in each trial the i-th row of A is picked with probability p{. The 
Expected (c) algorithm picks in expectation at most r rows of A to create R, by including the 
i-th column of A in C with probability min{l,rpj}. See Algorithms 4 and 5 in Appendix A for 
more details about these two row-sampling procedures. 



Data : A G ]R mxra , C G M mxc consisting of c columns of A, a positive integer r, and an 
error parameter e. 

Result : R G W xn consisting of r rows of A and W G M cxr consisting of the corresponding 
r rows of C, and U G M rxc . 

• Compute sampling probabilities for all z G [m] given by (17); 

• (Implicitly) construct a sampling matrix S R and a diagonal rescaling matrix Z)^ 
with the Exactly(c) algorithm or with the Expected(c) algorithm; 

• Construct and return the matrix R = D R S^A consisting of a small number of 
rescaled rows of A; 

• Construct and return the matrix W = D R S^C consisting of the corresponding 
rescaled rows of C; 

• Let U = W + ; 



Algorithm 2: A randomized algorithm for CUR matrix decomposition. 

Reading the input matrices to Algorithm 2 takes 0(mn) time; computing the full SVD of C 
requires 0(c 2 m) time; constructing the matrix R requires 0(rn) time; constructing the matrix 
W requires 0(rc) time; and computing U requires 0(c 2 r) time. Overall, the running time of the 
algorithm is 0(mn) since c, r are constants independent of m, n. This can be improved if the 
input matrices are sparse, but for simplicity we omit this discussion. 

5.2 Statement of the Theorem 

Theorem 4 is our main quality-of- approximation result for Algorithm 2. 

Theorem 4 Let A G M. mxn , let C G ]R mxc be a matrix consisting of any c columns of A, and let 
e G (0, 1]. If we set r = 3200c 2 /e 2 and run Algorithm 2 by choosing r rows exactly from A and 
from C with the Exactly(c) algorithm, then with probability at least 0.7 

\\A-CUR\\ F <(l + e)\\A-CC + A\\ F (18) 

Similarly, if we set r = 0(c log c/e 2 ) and run Algorithm 2 by choosing no more than r rows in 
expectation from A and from C with the Expected(c) algorithm, then (18) holds with probability 
at least 0.7. 

Proof: Consider the problem of approximating the solution to mm X £M. cxn \\CX — A\\ F by ran- 
domly sampling rows from C and A. It follows as a corollary of (21) of Theorem 5 of Section 6 
that 

\\A - C{D R S T R C) + D R S T R A\\ F < (1 + e) || A - CC + A\\ p , 
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where R = DrS^A and U = (DrS]^C) + , which establishes the theorem. 

o 

Remark: For simplicity of presentation, we have presented Algorithm 2 and Theorem 4 such 
that (18) holds with only constant probability, but this can be boosted to hold with probability 
at least 1 — 5 using standard methods. In addition, this can be combined with Algorithm 1 and 
Theorem 3 by doing the following: run Algorithm 1 ln(2/<5) times, and return the best C; then, 
with that C run Algorithm 2 ln(2/<5) times, and return the best U, R pair. Then 

\\A - CUR\\ F < (1 + e) \\A - CC + A\\ p < (1 + e) 2 \\A - A k \\ F < (1 + e') \\A - A k \\ F , 

where e = 3e, and the combined failure probability is no more than 6/2+5/2 = 5. This establishes 
Theorem 2. 



5.3 Discussion of the Analysis 

Assume that we are given an m x c matrix C, consisting of any set of c columns of an m x n 
matrix A, and consider the following idea for approximating the matrix A. The columns of C are 
a set of "basis vectors" that are, in general, neither orthogonal nor normal. To express all the 
columns of A as linear combinations of the columns of C, we can solve 



mm 



A® - Cxj 



for each column A^\j G [n], in order to find a c- vector of coefficients Xj and get the optimal 
least-squares fit for A^\ Equivalently, we can solve an optimization problem of the form (19). 
Note that if m and n are large and c = 0(1), then this is an overconstrained least-squares fit 
problem. It is well-known that X op t = C + A is the "smallest" matrix solving this optimization 
problem, in which case we are using information from every row of A to compute the optimal 
coefficient matrix. Let us approximate X opt as 

X op t = C + A rs (D r SrC) + DrS^A = X opt , 

and note that X opt = W + R. This matrix X op t is clearly suboptimal with respect to solving the 
optimization problem (19), i.e., 

\\A-CC + A\\ p < ||A-CW+R|| F , 

but it can be shown that by choosing Sr and Dr (the row sampling and rescaling matrices) 
properly we have that 

\\A - CW + R\\ p < (1 + e) \\A - CC + A\\ p . 

As in Section 4.3, the main technical challenge is to sample in a manner such that the row- 
sampled version of the matrix consisting of the top c left singular vectors of C is full rank, i.e., 
iimk{D R S]/ i Uc,c) = rank(^7 CjC ) = c. 



6 An Approximation Algorithm for Generalized Regression 

The basic linear-algebraic problem of £2 regression is one of the most fundamental regression 
problems, and it has found many applications in mathematics and statistical data analysis. Recall 
the standard £2 regression (or least-squares fit) problem: given as input a matrix A £ flj mxn an d 
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a target vector b G M m , compute Z = min^gRn \b — Ax\ 2 . Also of interest is the computation of 
vectors that achieve the minimum Z. If m > n there are more constraints than variables and the 
problem is an overconstrained least-squares fit problem; in this case, there does not in general 
exist a vector x such that Ax = b. It is well-known that the minimum-length vector among those 
minimizing \b — Ax\ 2 is x opt = A + b. We previously presented an elaborate sampling algorithm 
that represents the matrix A by a matrix by a small number of rows so that this £2 regression 
problem can be solved to accuracy lie for any e > [29]. 

This problem is of interest for CX and CUR matrix decomposition for the following reason. 
Given a matrix A and a set of its columns C, if we want to get the best fit for every column of 
A in terms of that basis, we want to solve CX ~ A for the matrix X. More precisely, we would 
like to solve the optimization problem such as 

Z= min \\A-CX\\ F . (19) 

It is well-known that the matrix X = C + A is the "smallest" matrix among those that solve this 
problem. In this case, we are approximating the matrix A as A' = CC + A = PqA, and by keeping 
only the columns C we are incurring an error of \\A — CC + A\\ F . Two questions arise: 

• First, how do we choose the columns C such that \\A — CC + A\\ F is within relative error e 
of||A-A fc || F ? 

• Second, how do we choose the rows R and a matrix U such that \\A — CUR\\ F is within 
relative error e of \\A — CC + A\\ F ? 

Motivated by these observations, we will consider the generalized version of the standard £2 
regression problem, as defined in (5) and (6). 

In this section, we first present Algorithm 3, which is our main random sampling algorithm 
for approximating the solution to the generalized £2 regression problem, and Theorem 5, which 
provides our main quality-of-approximation bound for Algorithm 3. Then, we discuss the novel 
nonuniform "subspace sampling" probabilities used by the algorithm. Finally, we present the 
proof of Theorem 5. 

6.1 Description of the Algorithm and Theorem 

Algorithm 3 takes as input an m x n matrix A with rank no greater than k, an m x p matrix 
B, a set of sampling probabilities {j>i}^=i> an d a positive integer r < m. It returns as output a 
number Z and a nx p matrix X op t. Using the sampling matrix formalism described in Section 2, 
the algorithm (implicitly) forms a sampling matrix S, the transpose of which samples a few rows 
of A and the corresponding rows of B, and a rescaling matrix D, which is a matrix scaling the 
sampled rows of A and B. Since r rows of A and the corresponding r rows of B are sampled, the 
algorithm randomly samples r of the m constraints in the original £2 regression problem. Thus, 
the algorithm approximates the solution of the regression problem AX ~ B, as formalized in 
(19) and (5), with the exact solution of the downsampled regression problem DS T AX « DS T B. 
Note that it is the space of constraints that is sampled and that the dimensions of the unknown 
matrix X are the same in both problems. Note also that although both m and n are permitted 
to be large, the problem is effectively overconstrained since rank(A) < k. As we will see below, 
r = O(klogk) or r = 0(k 2 ), depending on exactly how the random sample is constructed. Thus, 
we will compute the solution to the sampled problem exactly. 

Theorem 5 is our main quality-of-approximation result for Algorithm 3. Its proof may be 
found in Section 6.3. Recall that for our generalized £2 regression problem, the matrix A has rank 
no greater than k. 
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Data : A E W nxn that has rank no greater than k, B E M mxp , sampling probabilities 

{Pi\iL\i an d r <m. 
Result : A op i E M" xp , Z£l. 

• (Implicitly) construct a sampling matrix £ and a diagonal rescaling matrix D with 
the Exactly (c) algorithm or with the Expected (c) algorithm; 

• Construct the matrix DS A consisting of a small number of rescaled rows of A; 

• Construct the matrix DS T B consisting of a small number of rescaled rows of B; 

• X opt = (DS T A) + DS T B; 



• Z 



DS T B - DS T AX, 



opt 



Algorithm 3: A Monte-Carlo algorithm for approximating £2 regression. 



Theorem 5 Suppose A E M. mxn has rank no greater than k, B E W mxp , e E (0,1], and let 
Z = min^gjjnxp \\B — AX\\ F = \\B — AX opt \\ F , where X opt = A + B = A^B. Run Algorithm 3 



with any sampling probabilities of the form 

|2 



Pi>P- 



( U Ak)(i)\ 



YTj=\ ( u A,k 



k 



(U A ,k 



(0 



Vi E [n], 



(20) 



/or some (3 E (0, 1], and assume that the output of the algorithm is a number Z and an n x p 
matrix X op t. If exactly r = 3200fc 2 //3e 2 rows are chosen with the Exactly(c) algorithm, then 
with probability at least 0.7: 



B- AX, 



opt 



X pt X op i 



< 



0"min (A k ) 



(21) 
(22) 



//, in addition, we assume that 
probability at least 0.7: 



UA,kU^ k B > j\\B\\ F , for some fixed 7 E (0,1], then with 



X Q pt X op t 



< e 



K{A k )\fr 



1 \\X, 



opt 1 1 p ■ 



(23) 



Similarly, under the same assumptions, if r = 0(klogk/f3e 2 ) rows are chosen in expectation with 
the Expected(c) algorithm, then with probability at least 0.7, (21), (22), and (23) hold. 

Equation (21) states that if the matrix of minimumdength vectors achieving the minimum in 
the sampled problem is substituted back into the residual norm for the original problem then a 
good approximation to the original £2 regression problem is obtained. Equation (22) provides a 



bound for 



X Q pt X op t 



in terms of o~ m i n (Ak) and Z. If most of the "weight" of B lies in the 



complement of the column space of A = A k then this will provide a very poor approximation in 
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terms of ||X op t|| F . However, if we also assume that a constant fraction of the "weight" of B lies 
in the subspace spanned by the columns of A, then we obtain the relative-error approximation 



X, 



opt 



x, 



opt 



if Au is well- 



of Equation (23). Thus, Theorem 5 returns a good bound for 

conditioned and if B lies "reasonably well" in the column space of A. Note that if the matrix 
of target vectors B lies completely within the column space of A, then Z = and 7 = 1. In 
this case, Theorem 5 shows that Algorithm 3 returns Z and x op t that are exact solutions of the 
original £2 regression problem, independent of n{Ak). Finally, note that in our analysis of CX 
and CUR matrix decompositions we only use the result (21) from Theorem 5, but (22) and (23) 
are included for completeness. 



6.2 Discussion of the Method of "Subspace Sampling" 

An important aspect of Algorithm 3 is the nonuniform sampling probabilities (20) used by the 
Exactly (c) algorithm and the Expected (c) algorithm in the construction of the induced sub- 
problem. We call sampling probabilities satisfying condition (20) "subspace sampling" probabili- 
ties. Condition (20) states that the sampling probabilities should be close to, or rather not much 
less than, the lengths, i.e., the Euclidean norms, of the rows of the left singular vectors of the 
matrix A = A^. (Recall that in this section A is an m x n matrix with rank no more than k, and 
thus Uau is an m x matrix. Thus, the Euclidean norm of every column of Uau equals 1, but 
the Euclidean norm of every row of Ua,u is in general not equal and is only bounded above by 
1.) Sampling probabilities of the form (20) should be contrasted with sampling probabilities that 
depend on the Euclidean norms of the columns or rows of A and that have received much atten- 
tion recently [33, 34, 21, 22, 23, 26]. Since A = Ua^aVai sampling probabilities with this latter 
form depend in a complicated manner on a mixture of subspace information (as found in U a an d 
Va) and "size-of-A" information (as found in S^). This convolution of information may account 
for their ability to capture coarse statistics such as approximating matrix multiplication or com- 
puting low-rank matrix approximations to additive error, but it also accounts for their difficulty 
in dealing with problems such as £2 regression or computing low-rank matrix approximations to 
relative error. 

Since the solution of the £2 regression problem involves the computation of a pseudoinverse, 
the problem is not well-conditioned with respect to a perturbation (such as that introduced by 
sampling) that entails a change in dimensionality, even if (actually, especially if) that change in 
dimensionality corresponds to a small singular value. Since sampling probabilities satisfying (20) 
allow us to disentangle subspace information and "size-of-j4" information, we will see that they 
will allow us to capture (with high probability) the entire subspace of interest by sampling. More 
precisely, as we will see in Lemma 1, by using sampling probabilities that satisfy condition (20) 
and by choosing r appropriately, it will follow that 

vank{D S T U A,k) = rank(L r J 4 i fc) = k. 

Thus, the lengths of the Euclidean norms of the rows of Ua,h m ay be interpreted as capturing 
a notion of information dispersal by the matrix A since they indicate to which part of the m- 
dimensional vector space the singular value information of A is being dispersed. In this case, 
condition (20) ensures that the sampling probabilities provide a bias toward the part of the high- 
dimensional constraint space to which A disperses its singular value information. Then, having 
constructed the sample, we will go to the low-dimensional, i.e., the r-dimensional rather than 
then m-dimensional space, and approximate the I2 regression problem by doing computations 
that involve "size-of-vl" information on the random sample. 

This method of "subspace sampling" was first used in a preliminary version of the £2 regression 
results of this section [29]. Note that an immediate generalization of the results of [29] to the 
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generalized £2 regression problem considered in this section would involve sampling probabilities 
of the form 



Pi 



(?) 



+ 



(1/3) 


( U A,k) {i) 


2 






(UA,k) (j) 


( jj± jj± T g\ 

2 \ A,k A,k J ■ 



+ 



(1/3) (Ubpjt/B 



(24) 



rather than of the form (20). Since the second and third terms in (24) provide a bias toward 
the part of the complement of the column space of A = where B has significant weight, we 
directly obtain variance reduction. Thus, by using probabilities of the form (24) we can sample 
0(k 2 log(l/<5)/e 2 ) columns and directly obtain the claims of Theorem 5 with probability at least 
1 — 5. Although sampling probabilities of the form (20) are substantially simpler, we obtain 
variance control indirectly. We first establish that each of the claims of Theorem 5 holds with 
constant probability, and we then can show that each of the claims holds with probability at least 
1 — 5 by running 0(log(l/<5)) trials and using standard boosting procedures. 



6.3 Proof of Theorem 5 

In this section we provide a proof of Theorem 5. We will first prove (21), (22), and (23) under 
the assumption that the rows of A and B are sampled with the Exactly (c) algorithm. Then, 
in Section 6.3.5, we will outline modifications to the proof if the rows of A and B are sampled 
with the Expected(c) algorithm. For simplicity of notation in this section, we will let S = DS T 
denote the r x m rescaled row-sampling matrix. Let the rank of the m x n matrix A be p < k, 
and let its SVD be 

A = U A X A Vj, 

where U A £ R nxp , T, A £ R pxp , and V A G R dxp . In addition, let the rank of the r x p matrix 
SU A = DS T U A be p, and let its SVD be 

SU A = U S u A ^su A Vsu A , 

where U S u A € W xp , Y, S u A € W xp , and V SUa £ W xp . Recall that p<p<k<r. 

In order to illustrate the essential difficulty in constructing a sampling algorithm to approxi- 
mate the solution of the generalized £2 regression problem, consider inserting X opt = (SAk) + SB 
into B - A k X: 

B — AkX opt = B — Ak (SAk) + SB 

= B - UA,k^A,kVA,k {SUA,k^A,kYA,kY S B 
= B — UA,k^A,k {SUA,k^A,k) + SB 

= B - U A ,k^A,k (Usu A , k ^su A , k Vs UA ^A,kj SB 

= B - U A ,k^A,k (psuA,k V su Aik EAfi) U^ UA k SB. 

To proceed further, we must deal with the pseudoinverse, which is not well-behaved with respect 
to perturbations that involve a change in dimensionality. To deal with this, we will focus on 
probabilities that depend on the subspace that we are downsampling, i.e., that depend on Ua,Hi 
in order to guarantee that we capture the full subspace of interest. 
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6.3.1 Several lemmas of general interest 



In this subsection, we will present three lemmas of general interest. Then, in the next subsections, 
we will use these lemmas to prove each of the claims of Theorem 5. 

Since the m x k matrix Ua h is a matrix with orthogonal columns, several properties hold for 
it. For example, rank(C/^ ; fc) = k, U + = U T , and A^ = V^E^f/J fc . Although the r x k matrix 
SUA,k is does not have orthogonal columns, the following lemma characterizes the manner in 
which each of these three properties holds, either exactly or approximately. For the first lemma, 
r depends quadratically on k. 

Lemma 1 Let e G (0,1], and define = (SUA,k) + — {SUA,k) T ■ If the sampling probabilities 



satisfy equation (20) and if r > 400/c 2 / (3e 2 , then with probability at least 0.9: 

p = p, i.e., rank(SU A,k) = rankiJJ A,k) = rank(A k ) 



Nl 2 

(SA k ) 



J su A 



yA,k^~A,k (SUA,k 



- e; 



< e/V2 



Proof: To prove the first claim, note that for all i £ [p] 



(25) 
(26) 
(27) 
(28) 



\l-^(SU A , k )\ = \o- t (Ul k U A ,k)-o- t {Ul k S T SU A ,k)\ 

< \\ U A,k U A,k ~ Ua^S T SU A ,k\\ 2 

< \\ U A,k U A,k ~ UA^S T SU A ,k\\ F ■ 



(29) 
(30) 



Note that (29) follows from Corollary 8.1.6 of [37], and (30) follows since ||-|| 2 < \\-\\ F - To bound 
the error of approximating LJ\ k U A,k by LJ\ k S T SU A,k we apply Theorem 6 of Appendix A. Since 
the sampling probabilities pi satisfy equation (20), it follows from Theorem 6 and by applying 
Markov's inequality that with probability at least 0.9: 



\ U A,k U A,k ~ UA )k S T SUA,k\ 



F ^ 



< 



10 E 
10 

~7W 



\ U A,k U A,k - Ul M S T SU A ,k\\ F 



Wam\ 



F ' 



(31) 



where E [•] denotes the expectation operator. By combining (30) and (31), recalling that ||£7a,A;||^ = 
p < k, and using the assumed choice of r, it follows that 

\l - a* (SU A , k )\ <e/2< 1/2 

since e < 1. This implies that all singular values of SUA,k are strictly positive, and thus that 
rank(5t r ^4 j fc) = r&nk(UA,k) = ran k(^4fc), which establishes the first claim. 
To prove the second claim, we use the SVD of SUa k and note that 



inn 



(SU A ,k) + - (SU A ,k) 



(UsUa^SUa^VJu^ ~ (UsUa^SUa^VsUa^ 
v su A , k (psu Aik - S «S^A, fc ) U SU A „ 
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The claim follows since Vsu A k and Usu A k are matrices with orthonormal columns. 
To prove the third claim, note that 



(SA k ) + = (SU A ,k^A,kVl k ) + 

= {ysu A ^su Atk Vsu A:k ^A,kVj 
= V A ,k (r>su A , k Vsu A ^A,k) u£ Ua 



rT 
'AM 



(32) 



To remove the pseudoinverse in the above derivations, notice that since p = p with probability 
at least 0.9, all three matrices ^su A fcJ Vsu A k i an< ^ ^A,k are full rank square p x p matrices, and 
thus are invertible. In this case, 

(^SU A , k Vsu A ^A,k) = (^SU A:k Vsu A ^A,k) 

= ^Ak y su A , k ^su A y ( 33 ) 

By combining (32) and (33) we have that 

{SA k ) + = V Atk ^ k V S u A , k ^su A;k u su A , k 
= V A ,k^ k (SU Ak ) + , 

which establishes the third claim. 2 

Finally, to prove the fourth claim, recall that under the assumptions of the lemma p = p with 
probability at least 0.9, and thus oi (SU A>k ) > for all i € [p\. Thus, 

1 



J su Atk 



max 



max 



Ui {SU Ak 



Gj (SU Ak ) 



\<Jj {SUA,k) a j (SU A ,k) - 1| 

\(7j (SU A ,k)\ 



a] (SU Ak ) - 1 
< max — j — — - — — . 

je[ P ] \<Tj{SU A ,k)\ 

Using that fact that, by (29), for all i € [p], 

\l-af (SU A , k )\ < \\ul k U Ak - Ul k S T SU Atk \\ 2 , 
it follows that for all i € [p] 

1 1 



(34) 



Ui (SU A>k ) 



< 



1 - 



ul k u Ak -ul k s?su Ak 



A,k^ 



When these are combined with (34) it follows that 

Ul k UA,k - ul k s T su A>k 



^SU, k - £ 



su Atk 



< 



1 



2 One might be tempted to suggest that the proof of this third claim should be "simplified" by appealing to the 
result that the generalized inverse of the product of two matrices equals the product of the generalized inverse of 
those matrices. This result is, of course, false — see, e.g., Section 3.1.1 of [60]-and so we need a more refined analysis 
such as the one presented here. 
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Combining this with the Frobenius norm bound of (31), and noticing that our choice for r 



guarantees that 1 



U~AkUA,k — U^ k S T SUA,k > 1/2, concludes the proof of the fourth claim. 



This concludes the proof of the lemma. 



The next lemma provides an approximate matrix multiplication bound that is useful in the 
proof of Theorem 5. For this lemma, r depends linearly on k. 

Lemma 2 Let e G (0, 1]. If the sampling probabilities satisfy equation (20) and if r > 400/c//3e 2 , 
then with probability at least 0.9: 



U I h S T SU auUa u B 



< 6 - 
F 2 



U A,k U A,k B 



F 
T 



Proof: First, note that since U am is an orthogonal matrix and since U A k U Ak = , we have that 



U A,kUA,kS T S U A,k U A,k B 



Since 



u A , k u T 



AM 



(0 



UIm 



(0 



UAMUlkUikUik B-U A MUlkS T SUi k Ui k B (35) 

r 



the sampling probabilities (20) satisfy (45), where (45) 



will appear in Appendix A. 2, and thus are appropriate for bounding the right hand side of (35). 
Thus, it follows from Markov's inequality and Theorem 6 that with probability at least 0.9: 



ul k s T sui k ui k B 



10 E 



ul k s T sui k ui k B 



< 



10 >\TT 1J T I 
\ U AM U AM\ 



The lemma follows by the choice of r and since 



UamUT 



u am U am B 



The final lemma of this subsection relates the norm of the m x p matrix Uj[ k Uj[ k B to the 

norm of the r x p matrix SU^ k U Ak B, i.e., the row sampled and rescaled version of the original 
m x p matrix. For this lemma, r is independent of k. 



Lemma 3 With probability at least 0.9: 



T . 



s U Ak U Ak B 



< 10 



U AM U AM B 



Proof: Let Q = U Ak U Ak B, and let ji,j2, ■ ■ ■ ,jr t> e the r rows of Q that were included in 
SQ = DS T Q. Clearly, 



E 



\DS T Q\\ 2 p 



E 



E \Q<*: 



t=i 



" 1 2 



r n I ^-j 1 2 



t=i 



t=l j=l 



rpj 



where the penultimate equality follows by evaluating the expectation. The lemma follows by 
applying Markov's inequality and taking the square root of both sides of the resulting inequality. 

o 
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6.3.2 Proof of Equation (21) 

In this subsection, we will bound B — A k X opt , thus proving (21). For the moment, let us assume 
that r = 400fc 2 //3e 2 , in which case the assumption on r is satisfied for each of Lemma 1, Lemma 2, 
and Lemma 3. Thus, the claims of all three lemmas hold simultaneously with probability at least 
1 — 3(0.1) > 0.7, and so let us condition on this event. 
First, we have that 



B-A k X opt = B-A k {SA k ) + SB 

= B-U Ak (SU Ak ) + SB 



(37) 



B - U Ak (SU Ak ) + SU Atk U^ k B - U Atk (SU Atk ) + SUl k Ul k B (38) 
Ul k Ul k T B - U A , k (SU Atk ) + SUl k Ul k T B. (39) 



(37) follows from (27) of Lemma 1, (38) follows by inserting U A)k U\ k + U^ k U^ k = I n , and 
(39) follows since (SU Atk ) + SU Atk = I p by Lemma 1. We emphasize that (SU Atk ) + SU Atk = 
Vsu A k Vsu A k = Ip d° es n °t n °ld f° r general sampling methods, but it does hold in this case since 
p = p, which follows from Lemma 1. 

By taking the Frobenius norm of both sides of (39), by using the triangle inequality, and 
recalling that Q = (SU Ajk ) + — (SU Ajk ) T ', we have that 



B — A k X opt 



< 



< 



U A,k U A,k 7 B 


+ 

F 


U A)k U A ^ k B 


+ 

F 



U A , k (SU Ak fSUi k Ui k T B 



+ 



u A±S T SU Ak U Ak B 



F 

+ \M\ 2 



u Ak nsui k ui k i 



S U A,k U A,k B 



where (40) follows by submultiplicativity and since U A)k has orthogonal columns. By combining 
(40) with the bounds provided by Lemma 1 through Lemma 3, it follows that 



B — A k X opt 



< (1 + e/2 + We/V2)Z 



< (l + 8e)Z. 

Equation (21) follows by setting e' = e/8 and using the value of r assumed by the theorem. 
6.3.3 Proof of Equation (22) 

in terms of Z, thus proving 
rich case the assumption on r is 



X, 



opt 



X, 



opt 



In this subsection, we will provide a bound for 

(22). For the moment, let us assume that r = 400fc 2 //3e 2 , in w': 
satisfied for each of Lemma 1, Lemma 2, and Lemma 3. Thus, the claims of all three lemmas hold 
simultaneously with probability at least 1 — 3(0.1) > 0.7, and so let us condition on this event. 

Since U A ^U\ k + l' A j .l' Ak = I n and (SU Aik ) + SU A:k = I p , we have that 
X Q pt — X, 



*-opt 



= A+B-{SA k ) + SB 

= VA^A\ U lk B ~ V A ,k^k ( SU A,k) + SB 

= V^k^^U^B - V A}k YT^ k (SU A<k ) + SU Atk U A k B - V Atk YT^ k (SU A<k ) + SU A)k U A , 

= -V A ,k^A% ( su A,k) + SU A k U A k T B. 
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Thus, it follows that 



X Q pt X Q pt 



V A ^ A \{SU A , k ) + SUi k Ui k B 

^ k ((su Ak f + n)sui k ui k J 



D 



< 
< 



0~min(.A k ) 
1 



(SU A , k ) T SU^U^ B 



r - 



U A hS T SU Ak U Ak B 



F 0~min{A k ) 
1 



toSUi k Ui k B 



+ 



F 0~min(A k ) 



SU Ak U Ak B 



;4i) 

F 



By combining (41) with Lemmas 1, 2, and 3, it follows that 



X pt X Q pt 



< <in(^)(e/2 + 10e/V2)z 
8e 



< 



-Z. 



"'minimi A;) 

Equation (22) follows by setting e' = e/8 and using the value of r assumed by the theorem. 
6.3.4 Proof of Equation (23) 

The error bound provided by (22) could be quite weak, since min X6K nx P \\ B — A k X\\ F could be 
quite close or even equal to ||-B|| F , if B has most or all of its "weight" outside of the column 

space of A k . Under a slightly stronger assumption, we will provide a bound X opt — X opt in 

terms of ||Xop$||^,, thus proving (23). 

If we make the additional assumption that a constant fraction of the "weight" of B lies in the 
subspace spanned by the columns of A k , then it follows that 



min \\B — A k X\\ F 



In order to relate 



U A,k U A,k B 



- \\ B \\f ~ \\ UA ^ U A,k B] ^F 
< ( 7 - 2 -l)||^ fc [/X fcJ B|| 2 p 



U Ajk U A k B and thus Z to ||X pi|| F note that 



(42) 



opt 1 1 p 



V A,k^ A \ U A,k B 



^A^Ak 3 



> a mill (Y IA ^ k )\\U Ak B\\ F 
u A,kU Ayk B 

0~ma,x(A k ) 



(43) 
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By combining (22) with (42) and (43), we get 



X pt X Q pt 



which establishes (23). 



< 
< 



F ~" cr min (A fc ) 
e 



""min(^4fc / 



< e 



(A k ) 



V7- 2 - 1 II*. 



opt \\p J 



6.3.5 Modifications to the proof with alternate row sampling procedure 

If, in Algorithm 3, the rows are sampled with the Expected (c) algorithm, then the proof of the 
claims of Theorem 5 is analogous to the proof described in the four previous subsections, with 
the following major exception. The claims of Lemma 1 hold if r = 0(k log &//3e 2 ) rows are chosen 
with the Expected(c) algorithm. To see this, recall that to bound the first claim of Lemma 1, we 



must bound the spectral norm 



in (29). If the sampling is performed 



M A , k -U A ^SU A , k 

with the Exactly(c) algorithm, then this is bounded in (30) by the corresponding Frobenius 
norm, which is then bounded with Theorem 6. On the other hand, if the sampling is performed 
with the Expected (c) algorithm, then we can bound (29) directly with the spectral norm bound 
provided by Theorem 7. 

Since the remaining claims of Lemma 1 follow from the first, they are also valid if r = 
0{k\ogk/ f5e 2 ) rows are chosen with the Expected(c) algorithm. Lemma 2 still follows if 
r = 400/c//3e 2 , by using the Frobenius norm bound of Theorem 7, and Lemma 3 also follows 
immediately. The proofs of (21), (22), and (23) are identical, and thus Theorem 5, under the 
assumption that the the rows are chosen with the Expected (c) algorithm, follows. 



7 Empirical Evaluation 

Although this is a theoretical paper, it is motivated by applications, and thus one might wonder 
about the empirical applicability of our methods. For example, if we want to do as well as the 
best rank k = 100 (respectively, k = 10) approximation, with relative error bound e = 0.1, then 
our main theorem samples 3.2 billion (respectively 32 million) columns of the matrix A using the 
Exactly(c) algorithm. Of course, our main theorem states that it in order to obtain our strong 
provable worst-case relative-error guarantees it suffices to choose that many columns. But, it 
would be a source of concern if anything like that number of columns is needed in "real" scientific 
and internet data applications. 

In this section, we provide an empirical evaluation of the performance of our two main sampling 
procedures both for CX and CUR decompositions. In particular, we will evaluate how well the 
proposed column/row selection strategies perform at capturing the Frobenius norm for matrices 
derived from DNA Single Nucleotide Polymorphism (SNP) analysis, recommendation system 
analysis, and term-document analysis. By applying our algorithms to data sets drawn from these 
three diverse domains of modern data analysis, we will demonstrate that we can obtain very good 
Frobenius norm reconstruction by sampling a number of columns and/or rows that equals a small 
constant, e.g., 2 or 3 or 4 (as opposed to, e.g., a million or a billion), times the rank parameter k. 
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7.1 Details of Our Empirical Evaluation 

The empirical evaluation of our CX and CUR matrix decompositions has been performed using 
the following two types of column/row selection methods: 

• "Subspace sampling" (with replacement) using the Exactly (c) algorithm; and 

• "Subspace sampling" (without replacement) using the Expected(c) algorithm. 

In addition, the empirical evaluation has been performed on the following three data sets: 

• Matrices derived from the DNA SNP HapMap data [15, 52]-see Section 7.2; 

• A matrix derived from the Jester recommendation system corpus [35, 48] -see Section 7.3; 
and 

• A matrix derived from the Reuters term-document corpus [46, 16]-see Section 7.4. 

We have chosen these three data sets on which to evaluate the empirical applicability of our algo- 
rithms for three reasons: first, these three application domains are representative of a wide range 
of areas of modern scientific and internet data analysis; second, these matrices are all approxi- 
mately (to a greater or lesser extent) low-rank, and they are all data for which spectral methods 
such as low-rank approximations have been successfully applied; and third, we have already (with 
collaboraters from these application areas) applied our algorithms to these data sets [48, 52, 16]. 
In these data application papers [48, 52, 16], we have shown that our main CX and CUR decom- 
position algorithms (either the algorithms for which we have provable performance guarantees 
and/or greedy variants of these basic algorithms) perform well on tasks such as classification, 
denoising, reconstruction, prediction, and clustering — tasks that are of more immediate interest 
to data practitioners than simply capturing the norm of the data matrix. 

In this section, however, we we will restrict ourselves to an empirical evaluation of our two 
main theorems. To do so, we will fix a rank parameter k, and we will present plots of the 
Frobenius norm error (normalized by ||^4 — A^Hf), as a function of the number of samples chosen. 
For example, we will consider 0i = \\A — CC + A\\p/\\A — A^Wf, where A^ is the best rank-A; 
approximation to the matrix A, as a function of the number c of columns chosen. This ratio 
corresponds to the quantity that is bounded by 1 + e in Theorem 3. For c = k, this quantity 
will be no less than 1; of course, if we choose c > k columns then this ratio may be less than 1. 
Following the remark after Theorem 3, we will also consider ©2 = \\A — CC + A^Wf /\\A — \p- 
This ensures that the approximation has rank no greater than k (which is of interest in certain 
applications), and thus the plotted ratio will clearly be no less than 1, for every value of c. We 
will also consider 03 = \\A — CUR\\f/\\A — A^Wf, which corresponds to the quantity that is 
bounded by 1 + e in Theorem 4. 

Two technical points should be noted about these plots in the upcoming subsections. First, 
we ran our CX or CUR decomposition algorithm several — e.g., three or five, depending on the 
size of the data being plotted — times (corresponding, say, to multiple runs to boost the 5 failure 
probability) and the minimum value over these repetitions was returned; this was repeated several 
times and the average of those values is plotted. Second, for the plots of \ \A — CUR\\f/\\A — A^Wf, 
the number of rows selected is set to be twice the corresponding number of columns selected; 
optimizing over this would lead to marginally better performance than that presented. 

7.2 DNA SNP HapMap Data 

Our first dataset comes from the field of human genetics. The HapMap project, a continuation 
of the Human Genome project, aims to map the loci in the human genome that differ between 



25 



individuals [15]. The HapMap project focuses on the so-called SNPs (Single Nucleotide Polymor- 
phisms), which are a very common type of variation in the genome (nearly 10 7 such loci have 
been identified in the human genome). Significant motivation exists in the genetics community 
for minimizing the number of SNPs that must be assayed, and in [52], we demonstrated how 
CUR-type methods may be used to efficiently reconstruct unassayed SNPs from a small number 
of assayed SNPs. 

Both in [52] and here, we consider two regions of the genome known as HOXB and 17q25. 
Three populations were studied for each region: Yoruban, a sub-Saharan African population; a 
European population; and a joint Japanese/Chinese population. Each population had 90 indi- 
viduals, each corresponding to a row of the input matrix. Columns of each matrix correspond 
to SNPs within the HOXB or 17q25 regions. The genotypic data were encoded appropriately 
in order to be converted to numeric data in the form of matrices. (Careful preprocessing was 
done to remove fixed SNPs, as well as SNPs with too many missing entries, etc.) The HapMap 
project provided data on 370 SNPs in 17q25 and 571 SNP in HOXB [15]. Thus, for example, 
our matrix for the Yoruban population in HOXB is a 90 x 571 matrix, whose entries are in the 
set {— 1, 0, +1}. 3 The other data matrices are of similar (not extremely large) size. See [52] and 
references therein for details. 

Data not presented indicate that for all three populations and for both genomic regions, the 
data possess a great deal of linear structure. For example, in the 17q25 matrices, one needs 9, 
9, and 7 singular vectors to capture 80% of the Frobenius norm for the Yoruban, European, and 
the Japanese/Chinese populations, respectively; and one needs 18, 16, and 13 singular vectors, 
respectively, to capture 90%. The matrices for the HOXB region of the genome are even more 
redundant; one needs only 7, 6, and 4 singular vectors, respectively, to capture 80% of the 
Frobenius norm. 

In Figure 1, data are presented for the Yoruban HOXB data matrix. Each of the six subfigures 
presents a plot of the Frobenius norm error as a function of the number c of samples chosen. 
In particular, for two values of the rank parameter, i.e., k = 5 and k = 10, the ratio 0j = 
\\A - A'\\ F /\\A - A k \\ F is plotted, where: A' = CC+A for % = 1; A' = CC+A k for % = 2; and 
A' = CUR for i = 3. Clearly, in all these cases, only modest oversampling is needed to capture 
"nearly all" of the dominant part of the spectrum of the data matrix. For example, for k = 5: if 
c = 5 then ©i = 1.12; if c > 6 then ©i < 1.1; and if c > 9 then ©i < 1.0. Similarly, for k = 10: if 
c = 10 then ©i = 1.22; if c > 15 then ©i < 1.1; and if c > 18 then ©i < 1.0. Similar results hold 
if the projection onto the span of the columns is regularized through a rank-/c space and also if 
rows are chosen after the columns. For example, for k = 10: if c > 16 then ©2 < 1.2 and if c > 30 
then ©2 < 1.1. Similarly, even though the computations for ©3 are slightly worse and somewhat 
noisier due to the second level of sampling (columns and then rows), the results still show that 
only modest oversampling (of c relative to k) is needed. For example, if k = 10, then ©3 < 1.1 
if c > 20 or c > 28, depending on precisely how the columns are chosen. Interestingly, in this 
last case, not only are the plots noisier, but the Expected(c) algorithm and the Exactly(c) 
algorithm seem to lead to (slightly) different results as a function of c. 

Qualitatively similar results are seen for the other populations and the other genomic regions. 
For example, in Figure 2, data are presented for the European population for both the HOXB 
and the 17q25 regions of the genome for the value of the rank parameter k = 10. For the HOXB 
region, ©i = 1.36 if c = 10 (this is higher than for the corresponding Yoruban data), ©1 < 1.0 if 
c > 17 (this is similar to the corresponding Yoruban data), and ©i = 0.62 if c = 30 (this is less 

The encoding should be interpreted as follows: each SNP consists of two alleles (nucleotide bases); these bases 
are the same for all humans. Say that these bases are A and G. Then a value of +1 corresponds to an individual 
whose genotype (pair of alleles) is AA, a value of corresponds to an individual whose genotype is AG or GA, and 
a value of —1 corresponds to an individual whose genotype is GG. 
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Figure 1: Reconstruction error for the Yoruban population in the HOXB region of the genome. 
Shown are ©i, 62, and 63 (as defined in the text) for two values of the rank parameter k. The 
X-axis corresponds to the number of columns sampled with the Exactly(c) algorithm or the 
Expected (c) algorithm. 

than the Yoruban data). Similar trends are seen for 02 and 03 and also for the 17q25 region. In 
all cases, only very modest oversampling is needed for accurate Frobenius norm reconstruction. 
Data not presented indicate that the data for the joint Japanese/Chinese population is quite 
similar or slightly better to those results presented. 

7.3 Recommendation System Jester Data 

Our second dataset comes from the field of recommendation system analysis, in which one is 
typically interested in making purchase recommendations to a user at an electronic commerce web 
site [35] . Collaborative methods (as opposed to content-based or hybrid) involve recommending to 
the user items that people with similar tastes or preferences liked in the past. Many collaborative 
filtering algorithms represent a user as an n dimensional vector, where n is the number of distinct 
products, and where the components of the vector are a measure of the rating provided by that 
user for that product. Thus, for a set of m users, the user-product ratings matrix is an m x n 
matrix A, where Aij is the rating by user i for product j (or is null if the rating is not provided). 

The so-called Jester joke dataset is a commonly-used benchmark for recommendation system 
research and development [35]. In [48], we applied a CUR decomposition on this data to the 
problem of reconstructing missing entries and making accurate recommendations. Here, we con- 
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Figure 2: Reconstruction error for the European population in both the HOXB and 17q25 regions 
of the genome. Shown are Bi, 02, and G3 (as defined in the text) for k = 10. The X-axis corre- 
sponds to the number of columns sampled with the Exactly (c) algorithm or the Expected (c) 
algorithm. 

sider the m = 14, 116 (out of ca. 73, 000) users who rated all of the n = 100 products (i.e., jokes) 
in the Jester data. The entries in this 14, 116 x 100 matrix A are real numbers between —10 and 
+10 that represent the user's rating of a product. 

Figure 3 presents the empirical results for the Jester recommendation system data. The rank 
of the 14, 116 x 100 matrix is 100, and although only 7 singular vectore are needed to capture 
50% of the Frobenius norm, 50 are needed to capture 80%, and 73 are needed to capture 90%. 
Thus, the spectrum and shape of this matrix (this matrix is very rectangular) are very different 
from that of the matrices of the previous subsection. 

Figure 3 presents reconstruction error results for selecting columns (i.e., products or jokes), 
for selecting rows (i.e., users), and for selecting both columns and rows simultaneously. For 
example, when selecting columns from A, if k = 15, then Gi = 1.14 if c = 15, @i < 1 if 
c > 29, and ©1 = 0.99 when c = 30. Although the matrix is very rectangular, quantitatively very 
similar results are obtained for the analogue of ©1 (called ©{^ in the figure) if rows are sampled 
(or, equivalently if columns are sampled from A T ). Thus, when our main CX decomposition is 
applied to either A or to A T ', a small number of columns (products) or rows (users), capture most 
of the Frobenius norm of A that is captured by the best rank k approximation to A. A similar 
result holds for the simultaneously choosing columns and rows of A (both users and products), 



28 




Figure 3: Empirical results for the Jester recommendation system data. Shown are: the percent- 
age of the Frobenius norm captured as a function of the number of singular components; ©i for 
sampling columns from A for k = 5 and k = 15; the analogue of @i for selecting rows from A 
(i.e., 0i for sampling columns from A T ); and ©3 for selecting columns and then rows for k = 5 
and k = 15. 

and applying our CUR approximation algorithm. As with the data of the previous subsection, 
the data for ©3 are much noisier when both columns and rows are chosen, but even in this case 
©3 < 1.1 for k = 5 if c > 25 and ©3 < 1.2 for k = 15 if c > 30. In all these cases, data not 
presented indicate that qualitatively similar (but shifted) results are obtained for higher values 
of the rank parameter k. 

7.4 Term-Document Reuters Data 

Our third data set comes from the field of text categorization and information retrieval. In 
these applications, documents are often represented as a so-called "bag of words" and a vector 
space model is used. In 2000, Reuters Ltd made available a large collection of Reuters News 
stories for use in research and development of natural language processing, information retrieval, 
and machine learning systems. This corpus, known as "Reuters Corpus, Volume 1" or RCV1, is 
significantly larger (it contains over 800, 000 news items from 1996-97) than the older, well-known 
Reuters-21578 collection which has been heavily used in the text classification community [46]. 
In [16], we considered the problem of feature selection for improved classification, and we compared 
a CX-like column selection procedure to several traditional methods. The data come with class 
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labels and possess a hierarchical class structure (which we used in [16] but) which we ignored 
here. Here, we used the /tc-normalized term-document matrix and the training data from the 
one test-train split provided by Lewis et al. [46]. Thus, the Reuters matrix we considered here is 
a (very sparse) 47, 236 x 23, 149 matrix whose elements are real numbers between and 1 that 
represent a normalized frequency. 

Figure 4 presents the empirical results for the Reuters term-document data. Note that this 
data is not only much larger than the data from the previous two subsections, it is also less well 
approximated by a low rank matrix. Less than 50% of the Frobenius norm is captured by the 
first k = 100 singular components, and less than 80% is captured by the first k = 1500 singular 
components. (Nevertheless, spectral methods have frequently been applied to this data.) The 
matrix is very sparse, and performing computations is expensive in terms of space and time (due 
to multiple randomized trials and since the dense matrices of singular vectors are large) if the 
rank parameter k is chosen to be more than a few hundred. Thus, to demonstrate the empirical 
applicability of our main algorithms, we considered several smaller values of k. Here, we report 
results for: k = 10 and c = 10 to 250; for k = 20 and c = 20 to 500; and for k = 100 and c = 100 
to 700. Note that we report results only for columns chosen with the Expected(c) algorithm; 
initial unreported computations on several smaller systems indicate that very similar results will 
be obtained with the Exactly (c) algorithm. 

In all of these cases, and for all values of @±, 02, and ©3, only modest oversampling leads 
to fairly small reconstruction error. The worst data point reported was for 83 = 1.272 for 
k = 100 and c = 100, and even in that case 63 < 1.1 for c > 300. Interestingly, all the curves 
tend to decrease somewhat more slowly (as a function of oversampling c, relative to k) than the 
corresponding curves in the previous subsections do. Note that 0i does not decrease below 1.0 
for k = 10 until after c = 500; for k = 20, it drops below 1.0 by c = 400, and for k = 100 (which 
obviously captures the largest fraction of the Frobenius norm) it drops below 1.0 at c « 350. 
Thus, this phenomenon is likely related to the degree to which the chosen value for the rank 
parameter k captures a reasonable fraction of the norm of the original matrix. Nevertheless, in 
all cases, we can achieve 0j < 1.1 with only a modest degree of oversampling c relative to k. 

8 Conclusion 

We have presented and analyzed randomized algorithms for computing low-rank matrix approx- 
imations that are explicitly expressed in terms of a small number of columns and/or rows of the 
input matrix. These algorithm achieve relative-error guarantees, whereas previous algorithms for 
these problems achieved only additive-error guarantees. These algorithms randomly sample in a 
novel manner we call "subspace sampling," and their analysis amounts to approximating a gen- 
eralized £2 regression problem by random sampling. As described in Section 1.1 and in [52, 48], 
such low-rank matrix approximations have numerous applications for the improved analysis of 
data. 

We conclude with several open problems. 

• To what extent do the results of this paper generalize to other matrix norms? 

• What hardness results can be established for the optimal choice of columns and/or rows? 

• Does there exist a deterministic approximation algorithm for either of the problems we 
consider? 

• Does there exist an efficient deterministic algorithm to choose columns and/or rows that 
exactly or approximately optimize the maximum volume of the induced parallelepiped? (As 
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Figure 4: Empirical results for the Reuters term-document data. Shown are: the percentage of 
the Frobenius norm captured as a function of the number of singular components; 0i, G2, and 
63 as a function of the number of sampled columns and/or rows for three different values of the 
rank parameter k. 

pointed out to us by an anonymous reviewer, [40, 61] provide such procedures; it would be 
interesting to see if bounds of the form we prove can be established for the algorithms of 
[40, 61]). 

• Can we formulate a simple condition that we can check after we have sampled the columns 
and/or rows to determine whether we have achieved a 1+e approximation with that sample? 

• Can we obtain similar algorithms and comparable bounds for formulations of these problems 
that include regularization and/or conditioning? 

• What heuristic variants of these algorithms are most appropriate in different application 
domains? 

• Are the algorithms presented in this paper numerically stable? 
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A Approximating Matrix Multiplication 



In this section, we describe two complementary procedures for randomly sampling (and rescaling) 
columns and/or rows from an input matrix. Then, we describe an algorithm for approximating 
the product of two matrices by randomly sampling columns and rows from the input matrices 
using one of the two sampling procedures. 

A.l Sampling Columns and Rows from Matrices 

We describe two simple algorithms for randomly sampling a set of columns from an input matrix. 
Each algorithm takes as input an m x n matrix A and a probability distribution {pi}f =1 , and 
each constructs a matrix C consisting of a rescaled copy of small number of columns from A. 
Clearly, each algorithm can be modified to sample rows from a matrix. The first algorithm is the 
Exactly(c) algorithm, which is described in Algorithm 4 using the sampling matrix formalism 
described in Section 2. In this algorithm, c columns exactly of A are chosen in c i.i.d. trials, 
where in each trial the i-th. column of A is picked with probability pi. Note that because the 
sampling is performed with replacement a single column of A may be included in C more than 
once. The second algorithm is the Expected(c) algorithm, which is described in Algorithm 5, 
also using the sampling matrix formalism described in Section 2. In this algorithm, at most c 
columns in expectation of A are chosen by including the i-th column of A in C with probability 
Pi = min{l, cpj}. Note that the exact value of the number of columns returned is not known 
before the execution of this second algorithm; we do not perform an analysis of this random 
variable. 



Data : A G M. mxn , pi > 0,i G [n] s.t. J2ie[n]Pi = 1> positive integer c < n. 

Result : Sampling matrix S, rescaling matrix D, and sampled and rescaled columns C. 

Initialize S and D to the all zeros matrices. 

for t = 1, . . . , c do 



A. 2 Approximate Matrix Multiplication Algorithms 

Algorithm 6 takes as input two matrices A and B, a number c < n, and a probability distribution 
{Pi}i=i over [ n ]- It returns as output two matrices C and R, where the columns of C are a small 
number of sampled and rescaled columns of A and where the rows of R are a small number of 
sampled and rescaled rows of B. The sampling and rescaling are performed by calling either the 
Exactly(c) algorithm or the Expected(c) algorithm. When the Exactly(c) algorithm is used 
to choose column-row pairs in Algorithm 6, this is identical to the algorithm of [21]. In particular, 
note that exactly c column-row pairs are chosen, and a column-row pair could be included in the 
sample more than once. When the Expected (c) algorithm is used to choose column-row pairs 
in Algorithm 6 this is a minor variation of the algorithm of [21]. In particular, the main difference 




end 



C = ASD; 



Algorithm 4: The Exactly (c) algorithm to create S, D, and C. 
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Data : A G M. mxn , pi > 0,i G [n] s.t. J2ie[n]Pi = 1> positive integer c < n. 



Result : Sampling matrix S, rescaling matrix D, and sampled and rescaled columns C. 
Initialize S and D to the all zeros matrices. 



D tt = 1/ min{l, ^fcp]}; 
t = t + l; 
end 
end 

C = ,451}; 



Algorithm 5: The Expected(c) algorithm to create 5, D, and C. 

is that at most c column-row pairs m expectation are chosen, and no column-row pair is included 
in the sample more than once. 



Data : A G R mxn , B G M riXp , fe}™ =1 such that £"=iPi = 1, c < n. 
Result : C € M mxc , i? € M cxp . 

• Form the matrix C = ASD by sampling according to {p«}™ =1 with the Exactly(c) 
algorithm or with the EXPECTED (c) algorithm; 

• Form the matrix R = DS T B from the corresponding rows of B; 



Algorithm 6: A fast Monte-Carlo algorithm for approximate matrix multiplication. 

The next two theorems are our basic quality-of- approximation results for Algorithm 6. Each 
states that, under appropriate assumptions, CR = ASDDS B ~ AB. The most interesting of 
these assumptions is that the sampling probabilities used to randomly sample the columns of A 
and the corresponding rows of B are nonuniform and depend on the product of the Euclidean 
norms of the columns of A and/or the corresponding rows of B. For example, consider sampling 
probabilities {p»}" =1 such that 



for some (3 G (0, 1]. Sampling probabilities of the form (44) use information from the matrices A 
and B in a very particular manner. If (3 = 1, they are optimal for approximating AB by CR in 
a sense made precise in [21]. Alternatively, sampling probabilities {p«}™ =1 such that 





(44) 



E?=i \ A{3) \2 \ B U) 



Pi>P 




(45) 
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for some f3 G (0, 1], are also of interest in approximating the product AB by CR if, e.g., only 
information about A is easily available. 

The following theorem is our main quality-of-approximation result for approximating the prod- 
uct of two matrices with Algorithm 6, when column-row pairs are sampled using the Exactly(c) 
algorithm. Its proof (and the statement and proof of similar stronger results) may be found in 
[21]. 



Theorem 6 Suppose A G M. mxn , B G M nxp ; and c < n. Construct C and R with Algorithm 6, 
using the Exactly(c) algorithm. If the sampling probabilities {pi}™ =1 use d by the algorithm are 
of the form (44) or (45), then 



E [ || AB — Ci2||_p] < 



ffic 
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The following theorem is our main quality-of-approximation result for approximating the prod- 
uct of two matrices with Algorithm 6, when column-row pairs are sampled using the Expected(c) 
algorithm. The Frobenius norm bound (46) is new, and the spectral norm bound (47) is due to 
Rudelson and Vershynin, who proved a similar result in a more general setting [54, 62, 56]. 

Theorem 7 Suppose A G R mxn , B G M™ xp , and c < n. Construct C and R with Algorithm 6, 
using the Expected(c) algorithm. If the sampling probabilities {pi} r ( =1 used by the algorithm are 
of the form (44) or (45), then 



E[||AB-OR|| F ] < 
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(46) 



//, in addition, B = A T , then 



E[||AA -CC T || 2 ] < 0(1 




(47) 



Proof: Equation (47) follows from the analysis of Rudelson and Vershynin, who considered spectral 
norm bounds on approximating the product of two matrices [54, 62, 56]. Note that they considered 
approximating the product AA T by sampling with respect to probabilities of the form (45) with 
(3 = 1, but the analysis for general (3 G (0, 1] is analogous. 

Next, we prove that for any set of probabilities {pi}f =1 the following holds: 
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|AB - CRY 



1 \A (J) 
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(48) 



Equation (46) follows from (48) by using Jensen's inequality and using the form of the sampling 
probabilities (44) and (45). 

To establish (48), recall that the sampling is performed with the Expected (c) algorithm. 
Let Ij , j G [n] be the indicator variable that is set to 1 if the j'-th column of A and the j-th row 
of B are sampled (with probability min{l,cpj}) and is set to otherwise. Recall that if Ij = 1, 
we scale both the j-th column of A and the j-th row of B by 1 / y 7 min{ 1 , cpj } . Thus, 
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(49) 
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Clearly, if min{l, cpj} = 1, then Ij = 1 with probability 1, and 1 — Ij/ min{l, cpj} = 0. Thus, we 
can focus on the set of indices A = { j £ [n] : cpj < 1} C [n]. By taking the expectation of both 
sides of (49), it follows that 
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By multiplying out the right hand side, it follows that 
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Ai 1 j 1 Bj 1 i 2 Ai 1 j 2 Bj 2 i 2 . (50) 



Notice that for j G [A], E [1 - ij/cp,-] = and E (1 - Ij/cpj) 2 = {1/cpj) - 1 < 1/cp,-. Hence, 
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This concludes the proof of (48) and thus of the theorem 
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